Black Friday Dataset Analysis and Visualization

MSDS 7331 Data Mining - Section 403 - Lab 1

Group 3: Bhuvana Adnan Kaur, Selwyn Samuel, Anjli Solsi

Summary

The primary focus of this exploratory analysis is to better understand the Black Friday dataset. This involves examining each variable individually and the relationships between multiple variables. This notebook begins by explaining the data at a high level and what it consists of. It then discusses the issues identified in the dataset and how the team proposes to resolve them. A large chunk of the analyses are different visualizations demonstrating the individual attributes and relationships between them. This analysis ends with a look into new ideas and visualizations of the data.

Business Understanding

The day after Thanksgiving is commonly known as Black Friday. On Black Friday, many stores and retailers open extremely early and offer heavy promotional sales. Although Black Friday is known to offer the best prices on numerous products, it is very chaotic and sometimes dangerous, with crowds gathering hours and days in advance to snag such deals. Thus, there is no doubt Black Friday is generally the busiest shopping day of the year.

This dataset is originally from a competition hosted by Analytics Vidhya and was obtained from Kaggle. The dataset can be found at this link: https://www.kaggle.com/mehdidag/black-friday. The original competition revolved around a company wanting to understand the customer purchase behavior against various products of different categories. It focused specifically on building a model to predict the purchase amount against various products for each customer to help create personalized offers. However, Kaggle presents the data in a way that opens up the type of analysis that can be performed.

Future Work

There are various possibilities on outcomes from the dataset. For regression, the focus will be predicting the purchase amount. The original competition on Analytics Vidhya evaluates the predictions submitted against the actual purchase amount, scoring the result on the root mean squared error (RMSE). This will demonstrate the prediction error in the regression model created. For future projects concerning classification and clustering, a success rate goal will be pre-determined by the team.

In [88]:
# Load the Black Friday dataset
import pandas as pd
import numpy as np

# Vizualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


bf_df = pd.read_csv('../data/BlackFriday.csv')
bf_df.head()
Out[88]:
User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
0 1000001 P00069042 F 0-17 10 A 2 0 3 NaN NaN 8370
1 1000001 P00248942 F 0-17 10 A 2 0 1 6.0 14.0 15200
2 1000001 P00087842 F 0-17 10 A 2 0 12 NaN NaN 1422
3 1000001 P00085442 F 0-17 10 A 2 0 12 14.0 NaN 1057
4 1000002 P00285442 M 55+ 16 C 4+ 0 8 NaN NaN 7969

Data Meaning Type

The Black Friday dataset consists of 537,577 records and 12 attributes.

Below is a table consisting of all the variables in the dataset. The table includes the variable name, type of data, scale, and a description.

It is important to note that various variables true meaning has been masked from the dataset due to confidentiality, as indicated by (Masked) below. The variables will continue to be included in the analysis.

Variable Type Scale Values Description
User_ID int ordinal [(1000001,...,1006040), 5891 total] Unique customer identification number
Product_ID object ordinal [(P00000142,...,P0099942), 3623 total] Unique product identification number
Gender object nominal (0, 1) Sex of user (M or F)
Age object interval [(0-17,18-25,26-35,36-45,46-50,51-55,55+) Range of customer age
Occupation int nominal (0,1,2,3,4,5,6,7,8,9,10,11,12,...,20) Job role (Masked)
City_Category object nominal (A, B, C) City customer resides in
Stay_In_Current_City_Years object nominal (0, 1, 2, 3, 4+) Number of years living in city
Marital_Status int nominal (0, 1) Married (Yes or No)
Product_Category_1 int nominal (1,2,3,4,5,6,7,8,9,10,...,18) Main product category (Masked)
Product_Category_2 float nominal (NaN, 2,3,4,5,6,7,8,9,...,18) Product may belong to second category (Masked)
Product_Category_3 float nominal (NaN, 3,4,5,6,8,9,10,...,18) Product may belong to third category (Masked)
Purchase int ratio [(185,...,23948), 17959 total] Amount spent on purchase (USD)
In [92]:
bf_df.shape
Out[92]:
(537577, 12)

The Black Friday dataset contains 537577 observations and 12 attributes.

In [93]:
print (bf_df.dtypes)
User_ID                         int64
Product_ID                     object
Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years     object
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
dtype: object

Total number of Unique User Ids in the Black Friday dataset

In [94]:
uniqueCount = pd.value_counts(bf_df.User_ID)
uniqueCount
uniqueCountList = pd.Series({'Unique User Ids': len(uniqueCount), 'Unique User Id Values': uniqueCount.index.tolist()})
uniqueCountList
Out[94]:
Unique User Ids                                                       5891
Unique User Id Values    [1001680, 1004277, 1001941, 1001181, 1000889, ...
dtype: object

Total number of Unique Product Ids in the Black Friday dataset

In [95]:
uniqueCount = pd.value_counts(bf_df.Product_ID)
uniqueCount
uniqueCountList = pd.Series({'Unique Product Ids': len(uniqueCount), 'Unique Product Id Values': uniqueCount.index.tolist()})
uniqueCountList
Out[95]:
Unique Product Ids                                                       3623
Unique Product Id Values    [P00265242, P00110742, P00025442, P00112142, P...
dtype: object

Total Number of unique age group in the Blak Friday Dataset

In [96]:
uniqueCount = pd.value_counts(bf_df.Age)
uniqueCount
uniqueCountList = pd.Series({'Unique Age Groups': len(uniqueCount), 'Unique Age Group Values': sorted(uniqueCount.index.tolist())})
uniqueCountList
Out[96]:
Unique Age Groups                                                       7
Unique Age Group Values    [0-17, 18-25, 26-35, 36-45, 46-50, 51-55, 55+]
dtype: object
In [97]:
uniqueCount = pd.value_counts(bf_df.Product_Category_1)
uniqueCount
uniqueCountList = pd.Series({'Unique Product_Category_1': len(uniqueCount), 'Unique Product_Category_1 Values': sorted(uniqueCount.index.tolist())})
uniqueCountList
Out[97]:
Unique Product_Category_1                                                          18
Unique Product_Category_1 Values    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
dtype: object
In [98]:
uniqueCount = pd.value_counts(bf_df.Stay_In_Current_City_Years)
uniqueCount
uniqueCountList = pd.Series({'Unique Stay_In_Current_City_Years': len(uniqueCount), 'Unique Stay_In_Current_City_Years Values': sorted(uniqueCount.index.tolist())})
uniqueCountList
Out[98]:
Unique Stay_In_Current_City_Years                          5
Unique Stay_In_Current_City_Years Values    [0, 1, 2, 3, 4+]
dtype: object

Data Quality

Duplicate Data

User_ID identifies the customer who purchased the products. There are multiple rows for the same user, which means the same user has done multiple purchases. This is not a mistake, as more records per customer provide more data to use in developing models. We can preserve the attribute for analysis and predictions instead of dropping the attribute.

Missing Values

Based on the information below, there are missing values for two attributes (Product_Category_2 and Product_Category_3). Product Category 2 is missing 166,986 values and Product Category 3 is missing 373,299 values. Since those values are for extra classifications of product category and have been assigned NaN, it denotes that there were no items purchased in those categories on those observations. We chose to assign a value of 0 to all the missing values in the two columns. To be consistent, we will convert the types of those product categories to integer as well. The 0 is a placeholder and does not affect the integrity of the data. There are Product_Category_1 values for every record, so we made the decision to move forward this way.

After replacing the NaNs with 0, the data was checked to verify there are no missing values.

Outliers

Upon analysis of the variables, the Purchase attribute contains outliers. It is important to note that the majority of the variables are categorical and will not contain outliers. Upon examining the boxplot of the purchase amount, the values above 21,500 are considered outliers. The minimum value for Purchase is 185 and the maximum is 23,961. Since the outliers are not significantly larger, the analysis will be performed with and without the values to determine the best model.

In [100]:
print (bf_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
User_ID                       537577 non-null int64
Product_ID                    537577 non-null object
Gender                        537577 non-null object
Age                           537577 non-null object
Occupation                    537577 non-null int64
City_Category                 537577 non-null object
Stay_In_Current_City_Years    537577 non-null object
Marital_Status                537577 non-null int64
Product_Category_1            537577 non-null int64
Product_Category_2            370591 non-null float64
Product_Category_3            164278 non-null float64
Purchase                      537577 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 49.2+ MB
None
In [104]:
# Replace NaNs with 0s in all columns
bf_df.fillna(0, inplace=True)

# Change the data type of Product_Category_2 and Product_Category_3
bf_df["Product_Category_2"] = bf_df["Product_Category_2"].astype(np.int64)
bf_df["Product_Category_3"] = bf_df["Product_Category_3"].astype(np.int64)

bf_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
User_ID                       537577 non-null int64
Product_ID                    537577 non-null object
Gender                        537577 non-null object
Age                           537577 non-null object
Occupation                    537577 non-null int64
City_Category                 537577 non-null object
Stay_In_Current_City_Years    537577 non-null object
Marital_Status                537577 non-null int64
Product_Category_1            537577 non-null int64
Product_Category_2            537577 non-null int64
Product_Category_3            537577 non-null int64
Purchase                      537577 non-null int64
dtypes: int64(7), object(5)
memory usage: 49.2+ MB
In [105]:
sns.boxplot(bf_df["Purchase"])
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d1ae0de6a0>

Simple Statistics

The following table shows the basic statistics for continuous variables. In the table below, the only continuous variable is the Purchase column. This information is also referenced when discussing outliers. I find it important to note that although the mean is 9333.86, the range of values is from 18 to 23961. It will be interesting to see the relationships with the categorial variables.

In [106]:
# Get statistics for the continuous attributes
bf_df.describe()
Out[106]:
User_ID Occupation Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
count 5.375770e+05 537577.00000 537577.000000 537577.000000 537577.000000 537577.000000 537577.000000
mean 1.002992e+06 8.08271 0.408797 5.295546 6.784907 3.871773 9333.859853
std 1.714393e+03 6.52412 0.491612 3.750701 6.211618 6.265963 4981.022133
min 1.000001e+06 0.00000 0.000000 1.000000 0.000000 0.000000 185.000000
25% 1.001495e+06 2.00000 0.000000 1.000000 0.000000 0.000000 5866.000000
50% 1.003031e+06 7.00000 0.000000 5.000000 5.000000 0.000000 8062.000000
75% 1.004417e+06 14.00000 1.000000 8.000000 14.000000 8.000000 12073.000000
max 1.006040e+06 20.00000 1.000000 18.000000 18.000000 18.000000 23961.000000

This plot explores Gender and indicates there are more male customers making purchases than female customers. This will be useful information as relationships between attributes are explored later.

In [107]:
sns.countplot(bf_df.Gender)
Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d1ab547780>

This plot explores Marital_Status and indicates unmarried customers are making more purchases than married customers. This could be due to a variety of reasons, such as age, income, and occupation. It will be interesting to see the relationship with the other attributes.

In [108]:
sns.countplot(bf_df.Marital_Status)
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d1aa38d278>

This plot explores City_Category and indicates the most purchases are made from customers that reside in city B. The purchases mabe by customers in City A and C are close, however, those in City C make the second most purchases. It will be interesting to see the correlation between city and purchase amount.

In [109]:
sns.countplot(bf_df.City_Category)
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d1968f0780>

This plot explores Occupation and indicates that there are a variety of occupations making a high number of purchases. The ocupations of customers that stand out as making the most purchases are occupation 4, 0, and 7. It will be intersting to see the relationship between occupation and total or average purchase amount.

In [110]:
sns.countplot(bf_df.Occupation)
Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d1af9b95c0>

This plot explores Product_Category_1 as that is indicated as the main product category. This shows that products in the cateogries 5, 1, and 8 are the most popular, being purchased more than the other categories. It will be intersting to see the relationship between product category 1 and purchase amount as well as age to gain further insight.

In [111]:
sns.countplot(bf_df.Product_Category_1)
Out[111]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d180d75c18>

Visualize Attributes

The pie plot below clearly displays the Gender variable, showing the percentage of unique female and male customers. This is the best visualization for the information, because it demonstrates there are significantly more male customers than female customers. To be precise, there are 1666 female customers and 4225 male customers in the dataset.

In [112]:
print("Total Number of Purchase Transactions by Male Customers : ",len(bf_df[bf_df.Gender=='M']))
print("Total Number of Purchase Transactions by Female Customers : ",len(bf_df[bf_df.Gender=='F']))
print("Total Number of Unique Female & Male User Ids in Dataset : ",bf_df.groupby('Gender')['User_ID'].nunique())
plt.pie(bf_df.groupby('Gender')['User_ID'].nunique(),labels=['Female','Male'],shadow=True,autopct='%1.1f%%',
       colors=['pink','steelblue'])
plt.title('Percentage of Unique Female & Male Customers')
plt.show()
Total Number of Purchase Transactions by Male Customers :  405380
Total Number of Purchase Transactions by Female Customers :  132197
Total Number of Unique Female & Male User Ids in Dataset :  Gender
F    1666
M    4225
Name: User_ID, dtype: int64

The below pie plot also examines Gender, but looks at the total purchase amount by gender. It seems to agree with the data above, that since there are more male customers, male customers will spend more money than female customers. Again, this plot clearly shows the differentiation between the two values of the variable. Male customers are responsible for 76.8 percent of the total purchase amount, while female customers are responsible for 23.2 percent.

In [114]:
print(bf_df.groupby('Gender')['Purchase'].sum())
plt.pie(bf_df.groupby('Gender')['Purchase'].sum(),labels=['Female','Male'],shadow=True,
       autopct='%1.1f%%',colors=['pink','steelblue'])
plt.title('Total Purchase by Gender')
plt.show()
Gender
F    1164624021
M    3853044357
Name: Purchase, dtype: int64

From the above pie chart it is clear that
28.3% of Females contributed 23.2% of total purchase.
71.7% Male contributed 76.8% of total purchase.

In [115]:
priceRange = [20000,15000,10000,5000,3000,1000]

for price in priceRange:
    print("Total % of Purchase where Price Range >= ",str(price))
    print(len(bf_df[bf_df.Purchase>=price])/len(bf_df)*100.0)
Total % of Purchase where Price Range >=  20000
2.3291547071396286
Total % of Purchase where Price Range >=  15000
20.257749122451298
Total % of Purchase where Price Range >=  10000
34.738279353469366
Total % of Purchase where Price Range >=  5000
83.36387159420883
Total % of Purchase where Price Range >=  3000
93.09550073756876
Total % of Purchase where Price Range >=  1000
98.68967608361221

This analysis examines the total percent of purchases that fall in the defined price ranges. It is intersting that 83.4 percent of the purchases are greater than or equal to 5,000 dollars, and the percentage drops to 34.7 percent of the purchases being greater than or equal to 10,000 dollars. It is also important to note that total percent of purchases greater than or equal to 1,000 dollars is 98.7, which is a large majority.

In [116]:
import plotly.offline as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
from matplotlib import pyplot as plt

from plotly import tools 
#Purchase amount in USD by City categories 
labels = sorted(bf_df.City_Category.unique())
values = bf_df.groupby("City_Category").sum()["Purchase"]
colors = ['#E74B2C', '#26C05E', '#26A6CO']

trace = go.Pie(labels=labels, values=values,
               hoverinfo='label+percent', textinfo='value', 
               textfont=dict(size=20),
               title = 'Purchase Amount by City',
               marker=dict(colors=colors, 
                           line=dict(color='#000000', width=2)))
iplot([trace])

This plot displays the purchase amount for the different cities. A pie plot is the most appropriate, clearly demonstrating how much of the total comes from the respective cities. City B is seen to spend the most, which relates to the above bar plot that shows the customers in city B are responsible for making the highest number of purchases.

In [117]:
#Most Purchased Product Category by Counts
data = [go.Bar(
            x=bf_df.Product_Category_1.value_counts().sort_index().index,
            y=bf_df.Product_Category_1.value_counts().sort_index().values
    )]
layout = go.Layout(
    title='Most Purchased Product Category',
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

This bar plot indicates that the most purchased products belong to category 5. A bar plot clearly defines the purchases for each of the categories. The most number of purchases for category 5 is also equivalent to approximately 148,000 purchases.

In [118]:
sns.distplot(bf_df['Purchase'], hist=True, rug=True)
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d18155f240>

The plot below clearly shows the distribution of purchase amount, and the line of fit helps us see the pattern. The highest peak seems to be around 8,000 dollars by looking at the plot. The mean purchase amount is 9,333 dollars, which seems to agree with this plot. The purchase amounts seem to range between 0 and 15,000 dollars for the most part, with other peaks around 16,000 and 20,000 dollars. It will be interesting to see how this relates to the other attributes.

In [119]:
import plotly.offline as py 
import plotly.graph_objs as go

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)

%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

#sns.pairplot(bf_df, hue="Gender", height=2)

Pair plots are not very helpful because of the volume of data

Purchase Amount by Age Group

In [120]:
male_observations = bf_df[bf_df['Gender'] == 'M']
female_observations = bf_df[bf_df['Gender'] == 'F']

male_purchases_by_age = bf_df.groupby(by=['Age','Gender'])
print(male_purchases_by_age.Purchase.sum())
female_purchases_by_age = female_observations.groupby("Age").Purchase
Age    Gender
0-17   F           41826615
       M           90832391
18-25  F          202209450
       M          699459830
26-35  F          433857680
       M         1565891426
36-45  F          239010480
       M          771639085
46-50  F          114796993
       M          298621230
51-55  F           87972407
       M          273935949
55+    F           44950396
       M          152664446
Name: Purchase, dtype: int64

This analysis shows the purchase amount by gender per age group. This data will be better visualized in the next section.

Explore Joint Attributes

Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

Percentage of purchases by Gender and Age Group

In [121]:
male_purchase_percentages = male_observations.groupby("Age").Purchase.sum() / bf_df.Purchase.sum() * 100
female_purchase_percentages = female_observations.groupby("Age").Purchase.sum() / bf_df.Purchase.sum() * 100

trace0 = go.Scatter(x = male_purchase_percentages.index, 
                    y = male_purchase_percentages, 
                    name = 'Male',
                    mode = 'lines+markers')

trace1 = go.Scatter(x = female_purchase_percentages.index, 
                    y = female_purchase_percentages, 
                    name = 'Female',
                    mode = 'lines+markers')

data = [trace0, trace1]
layout = dict(title = 'Percentage of Purchases by Gender & Age Groups [1]',
              xaxis = dict(title = 'Age Group'),
              yaxis = dict(title = 'Purchase Percentage (%)')
             )

figure = dict(data=data, layout=layout)
iplot(figure)

Based on the line chart, we observe that males have made more purchases than females, which agrees with the visualizations created above. Also, the age group of customers ranging from 26-35 years old has made the most purchases. For this age group, males have contributed 31 percent of the purchase amount and females have contributed 8.6 percent. The most purchases are made in the age range of 18-45, which tends to be a major chunk of the working age. The percent of purchases decreases as the age range passes 45. This may be due to being settled down. It is important to note that females account for a smaller percent of purchases across all age groups

Percentage of Purchase Count by Product Category 1 and Age Group

In [122]:
#prdctCat1_purchases_by_age = bf_df.groupby(by=['Age','Product_Category_1'])
#print(prdctCat1_purchases_by_age.Purchase.sum())
In [123]:
outdflist =[]
cat1Values = sorted(pd.value_counts(bf_df.Product_Category_1).index.tolist())
for cat1 in cat1Values:
    # create a subdf with desired columns:
    subdf = bf_df[bf_df['Product_Category_1'] == cat1]
    newSubDf = subdf[['Product_Category_1','Age','Purchase']].copy()
    outdflist.append(newSubDf)

outdfCntlist =[]
for df in outdflist:
    dfCnt = df.groupby("Age").Purchase.count() / bf_df.Purchase.count() * 100
    outdfCntlist.append(dfCnt)

traceList = []
index=0
for cnt in outdfCntlist:
    index+=1;
    trace = go.Scatter(x = cnt.index, 
                    y = cnt, 
                    name = 'PrdctCat1 = '+str(index),
                    mode = 'lines+markers')
    traceList.append(trace)

layout = dict(title = 'Percentage of Purchase Count by Product Category1 & Age Group',
              xaxis = dict(title = 'Age Group'),
              yaxis = dict(title = 'Purchase Count Percentage (%)')
             )

figure = dict(data=traceList, layout=layout)
iplot(figure)

From the chart it looks like Product Category 1 = "5" has been purchased more across all age groups, but the age group 55+ seems to have purchased more products which belong to category 8. This may be due to a shift in priorities at different ages. Again, the largest percentage of purchase count is in the age range of 26-35, with the largest categories being 1 and 5.

Average purchases by Marital Status and Age Group

In [124]:
# The dataset does not contain any data in the married age group 0-17 which could be because of the age limit for marriages in 
# the US. We wii exclude that age group to make the plot look better.

married_observations = bf_df[bf_df['Marital_Status'] == 1]
unmarried_observations = bf_df[(bf_df['Marital_Status'] == 0) & (bf_df['Age'] != '0-17')]

married_purchase_averages = married_observations.groupby("Age").Purchase.mean()
unmarried_purchase_averages = unmarried_observations.groupby("Age").Purchase.mean()

trace0 = go.Scatter(x = married_purchase_averages.index, 
                    y = married_purchase_averages, 
                    name = 'Married',
                    mode = 'lines+markers')

trace1 = go.Scatter(x = unmarried_purchase_averages.index, 
                    y = unmarried_purchase_averages, 
                    name = 'Unmarried',
                    mode = 'lines+markers')

data = [trace0, trace1]
layout = dict(title = 'Average Purchase Amount by Marital Status & Age Groups',
              xaxis = dict(title = 'Age Group'),
              yaxis = dict(title = 'Average Purchase ($)')
             )

figure = dict(data=data, layout=layout)
iplot(figure)

Based on the plot, customers in the 46-50 age group spend the least amount on purchases. For customers in the age range of 26-35 there is a constant purchase average, irrespective of marital status. Similarly, for the age range of 18-25, there is not a large difference based on marital status. We found the decline in average purchases from age 36-45 to 46-50 interesting, especially considering the incline from age 46-50 to 51-55. Customers in the age range 51-55, both married and unmarried, have the highest purchase amount average. This is fascinating, since we have seen before that the customers between 26-35 have the highest purchase amount total.

Explore Attributes and Class

Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

In [125]:
bf_df['Age'].value_counts()
Out[125]:
26-35    214690
36-45    107499
18-25     97634
46-50     44526
51-55     37618
55+       20903
0-17      14707
Name: Age, dtype: int64

The above data shows the unique number of customer age category from the black friday dataset. As seen above the age group 26-35 contributed to more sales/ made more purchase. One possible reason is this is group where more individuals would belong to active working group as well.

In [126]:
bf_df['Product_ID'].value_counts().nlargest(5)
Out[126]:
P00265242    1858
P00110742    1591
P00025442    1586
P00112142    1539
P00057642    1430
Name: Product_ID, dtype: int64
In [127]:
The above are the top 5 most popular or frequently bought products.
  File "<ipython-input-127-43a19be02aa4>", line 1
    The above are the top 5 most popular or frequently bought products.
            ^
SyntaxError: invalid syntax
In [128]:
prdct1_observations = bf_df[bf_df['Product_ID'] == 'P00265242']
prdct2_observations = bf_df[bf_df['Product_ID'] == 'P00110742']
prdct3_observations = bf_df[bf_df['Product_ID'] == 'P00025442']
prdct4_observations = bf_df[bf_df['Product_ID'] == 'P00112142']
prdct5_observations = bf_df[bf_df['Product_ID'] == 'P00057642']


prdct1_purchase_averages = prdct1_observations.groupby("Age").Purchase.count()
prdct2_purchase_averages = prdct2_observations.groupby("Age").Purchase.count()
prdct3_purchase_averages = prdct3_observations.groupby("Age").Purchase.count()
prdct4_purchase_averages = prdct4_observations.groupby("Age").Purchase.count()
prdct5_purchase_averages = prdct5_observations.groupby("Age").Purchase.count()


trace0 = go.Scatter(x = prdct1_purchase_averages.index, 
                    y = prdct1_purchase_averages, 
                    name = 'Top 1st P00265242',
                    mode = 'lines+markers')

trace1 = go.Scatter(x = prdct2_purchase_averages.index, 
                    y = prdct2_purchase_averages, 
                    name = 'Top 2nd P00110742',
                    mode = 'lines+markers')

trace2 = go.Scatter(x = prdct3_purchase_averages.index, 
                    y = prdct3_purchase_averages, 
                    name = 'Top 3rd P00025442',
                    mode = 'lines+markers')
trace3 = go.Scatter(x = prdct4_purchase_averages.index, 
                    y = prdct4_purchase_averages, 
                    name = 'Top 4th P00112142',
                    mode = 'lines+markers')

trace4 = go.Scatter(x = prdct5_purchase_averages.index, 
                    y = prdct5_purchase_averages, 
                    name = 'Top 5th P00057642',
                    mode = 'lines+markers')

data = [trace0,trace1,trace2,trace3,trace4]
layout = dict(title = 'Purchase Count for Top Products across Age groups',
              xaxis = dict(title = 'Age group'),
              yaxis = dict(title = 'Purchase Count')
             )

figure = dict(data=data, layout=layout)
iplot(figure)
In [129]:
The above chart shows how the purchase count varies for the top 5 products acroos different age groups.
The product Id "P00265242" has been purchased 1858 times remains the top product for all age categories except "0 to 17" and "36 to 45" years age group.
The product Id "P00112142" which is in the top 4th position, remains the second or third most bought product for age categoris between 0 to 35.
But Product "P00025442" which is top 3rd in the list remains the second or third most bought product for age 36 plus.

This data may be useful for product promotion that is targetted to customers based on their age group.
  File "<ipython-input-129-6b14d1f056bf>", line 1
    The above chart shows how the purchase count varies for the top 5 products acroos different age groups.
            ^
SyntaxError: invalid syntax
In [130]:
city_A_observations = bf_df[bf_df['City_Category'] == 'A']
city_B_observations = bf_df[bf_df['City_Category'] == 'B']
city_C_observations = bf_df[bf_df['City_Category'] == 'C']

cityA_purchase_averages = city_A_observations.groupby("Stay_In_Current_City_Years").Purchase.sum()
cityB_purchase_averages = city_B_observations.groupby("Stay_In_Current_City_Years").Purchase.sum()
cityC_purchase_averages = city_C_observations.groupby("Stay_In_Current_City_Years").Purchase.sum()

trace0 = go.Scatter(x = cityA_purchase_averages.index, 
                    y = cityA_purchase_averages, 
                    name = 'City A',
                    mode = 'lines+markers')

trace1 = go.Scatter(x = cityB_purchase_averages.index, 
                    y = cityB_purchase_averages, 
                    name = 'City B',
                    mode = 'lines+markers')

trace2 = go.Scatter(x = cityC_purchase_averages.index, 
                    y = cityC_purchase_averages, 
                    name = 'City C',
                    mode = 'lines+markers')

data = [trace0,trace1,trace2]
layout = dict(title = 'Average Purchase Amount by City Category & Stay_In_Current_City_Years',
              xaxis = dict(title = 'Stay_In_Current_City_Years'),
              yaxis = dict(title = 'Total Purchase ($)')
             )

figure = dict(data=data, layout=layout)
iplot(figure)

The above chart shows the relationship between City, Stay in current City & Purchase. A prominent spike in purchase total is observed when stay in current city is between 0 to 1 year. There is a sharp fall in purchase total for stay between 1 to 2 years. The Purchase total does not rise any further but starts to gradually decrease as stay in current City increase. However, for City category B this trend for purchase total seem to be gradually increase as stay in current city approaches 3 years. This graph, reveals some interesting facts from the data set.

  1. To increase the Total Purchase, the shop could focus more on promotion that targets customers who recently moved to the city.
  2. City category B , seem to have better spending capacity / purchase capacity when compared to others. The gradual increase in Total Purchase for customers with stay in city between 2 to 3 years can be studied further & the observations can be used to device a better plan to increase black Friday sales.

New Features

Are there other features that could be added to the data or created from existing features? Which ones?

In [131]:
def map_PurchaseAmnt(purchase):
    if purchase >= 20000:
        return '>=20K'
    elif purchase >= 15000 and purchase < 20000:
        return '>=15K < 20K'
    elif purchase >= 10000 and purchase < 15000:
        return '>=10K < 15K'
    elif purchase >= 5000 and purchase < 10000:
        return '>=5K < 10K'
    elif purchase >= 3000 and purchase < 5000:
        return '>=3K < 5K'
    elif purchase >= 1000 and purchase < 3000:
        return '>=1K < 3K'
    else: 
        return 'Less Than 1K'
bf_df['PurchaseAmntCategory'] = bf_df['Purchase'].map(map_PurchaseAmnt)

df_groupByPurchaseAmntAndCategry = bf_df.groupby(by=['Product_Category_1','PurchaseAmntCategory'])


outdflist =[]
order1 = ['>=20K','>=15K < 20K','>=10K < 15K','>=5K < 10K','>=3K < 5K','>=1K < 3K','Less Than 1K']
order = {key: i for i, key in enumerate(order1)}
print(order)
amntCatgryValues = sorted(pd.value_counts(bf_df.PurchaseAmntCategory).index.tolist(), key=order1.index, reverse=True)
print(amntCatgryValues)
for amntCatgry in amntCatgryValues:
    # create a subdf with desired columns:
    subdf = bf_df[bf_df['PurchaseAmntCategory'] == amntCatgry]
    newSubDf = subdf[['PurchaseAmntCategory','Product_Category_1','Purchase']].copy()
    outdflist.append(newSubDf)

outdfCntlist =[]
for df in outdflist:
    dfCnt = df.groupby("Product_Category_1").Purchase.sum() 
    #print(dfCnt)
    outdfCntlist.append(dfCnt)

traceList = []
index=0;
for cnt in outdfCntlist:    
    #print(cnt.index.tolist())
    trace = go.Bar(x = cnt, 
                   y = cnt.index,
                   name = amntCatgryValues[index],
                   orientation = 'h',
                   marker = dict(line = dict(width = 1)))
    index +=1
    traceList.append(trace)
layout = go.Layout(height=750,
    barmode='stack',bargap =0.1,title = 'Total Purchase amount by Amount category across Different Product categories',
    xaxis = dict(title = 'Total Purchase amount ($) by Amount category'),yaxis = dict(title = 'Product categories'))
fig = go.Figure(data=traceList, layout=layout)
py.iplot(fig, filename='marker-h-bar')
{'>=20K': 0, '>=15K < 20K': 1, '>=10K < 15K': 2, '>=5K < 10K': 3, '>=3K < 5K': 4, '>=1K < 3K': 5, 'Less Than 1K': 6}
['Less Than 1K', '>=1K < 3K', '>=3K < 5K', '>=5K < 10K', '>=10K < 15K', '>=15K < 20K', '>=20K']

This horizontal bar chart shows the total purchase amount by amount category for all Product Category1 (1 to 18). The amount category is a new column that categorizes the row based on the Purchase amount. Below are the Amount Category created for every row in the dataset.

  1. Less Than 1K
  2. “>=1K < 3K”
  3. “>=3K < 5K”
  4. “>=5K < 10K”
  5. “>=10K < 15K”
  6. “>=15K < 20K”
  7. “>=20K”

As shown above Product Cateory1 =1 contributes to the maximum purchase amount total. Product Category1=1, has the maximum purchase total of 1.9B, the major contribution comes from products which costs between 15 to 20K USD. These products contribute 1.26 B of total purchase which is 66% of total purchase within Product Category1=1.

The next Category with highest total purchase amount is Product Category1=5. The major contribution comes from products which costs between 5 to 10K USD. The above also applies to Product Category1=8 which has the third highest total purchase amount.

Other interesting observation here is, the products which costs more than >=20K USD are observed only in product category 1 = 6,7,10,15,16 but none of these categories have a significantly higher total purchase amount.

Also the products which is less than 1K USD are observed only in product categories1=4,12,13,18. The contribution to overall purchase amount total comes from Products that costs between 15 to 20K USD & 5 to 10K USD.

In [132]:
def map_ageCategory(age):
    if age == '0-17':
        return 'Teen'
    elif age == '18-25':
        return 'Early Twenties'
    elif age == '26-35':
        return 'Late Twenties and Early Thirties'
    elif age == '36-45':
        return 'Mid Thirties and Early Forties'
    elif age == '46-50':
        return 'Late Forties'
    elif age == '51-55':
        return 'Early Fifties'
    else: 
        return 'Late Fifties Plus'
bf_df['AgeCategory'] = bf_df['Age'].map(map_ageCategory)

df_groupByAge = bf_df.groupby(by=['AgeCategory'])
print ("Percentage of Purchase in each age category:")
print (df_groupByAge.Purchase.sum() / bf_df.Purchase.sum() *100)
Percentage of Purchase in each age category:
AgeCategory
Early Fifties                        7.212680
Early Twenties                      17.969886
Late Fifties Plus                    3.938380
Late Forties                         8.239250
Late Twenties and Early Thirties    39.854150
Mid Thirties and Early Forties      20.141817
Teen                                 2.643838
Name: Purchase, dtype: float64
In [133]:
dfAgeCatgry = bf_df.groupby("AgeCategory").Purchase.sum().sort_values()
print(dfAgeCatgry)
trace0 = go.Scatter(
    x=dfAgeCatgry,
    y=dfAgeCatgry.index,
    mode='markers',
    marker=dict(
        color=['rgb(93, 164, 214)', 'rgb(255, 144, 14)',
               'rgb(44, 160, 101)', 'rgb(255, 65, 54)','rgb(25, 55, 65)','rgb(55, 25, 75)','rgb(150, 165, 254)'],
        opacity=[1, 0.8,0.5, 0.6, 0.5,0.6,0.7],
        size=[30,40,50,60,70,75,80],
    )
)

layout = go.Layout(
    title='Purchase total by Age Category',
    xaxis=dict(
        title='Total Purchase',
        gridcolor='rgb(255, 255, 255)',
        gridwidth=2,
    ),
    yaxis=dict(
        title='Age Category',
        gridcolor='rgb(255, 255, 255)',
        gridwidth=2,
    ),
    margin=go.layout.Margin(
    l=250,
    r=50
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)

data = [trace0]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig,filename='bubblechart-color')
AgeCategory
Teen                                 132659006
Late Fifties Plus                    197614842
Early Fifties                        361908356
Late Forties                         413418223
Early Twenties                       901669280
Mid Thirties and Early Forties      1010649565
Late Twenties and Early Thirties    1999749106
Name: Purchase, dtype: int64

This bubble chart shows the total purchase amount by Age categories. The age category is a new column that categorizes the row based on the customer age range. Below are the Age Category created for every row in the dataset.

• Teen
• Late Fifties Plus
• Early Fifties
• Late Forties
• Early Twenties
• Mid Thirties and Early Forties
• Late Twenties and Early Thirties

The bubble chart above shows themagnitude fo Purchase amount total for different Age category. As noticed earlier, the age category "Late Twenties And Early Thirties" has an overall purchase total of 2B. This is 1 billion greater than the next age category "Mid Thirties and Early Forties". This is a significant increase in the purchase amount total between the first & second top age catgories.

Exceptional Work

Provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.

In [134]:
fig = {
    "data": [
        {
            "type": 'violin',
            "x": bf_df['AgeCategory'] [ bf_df['Gender'] == 'M' ],
            "y": bf_df['Purchase'] [ bf_df['Gender'] == 'M' ],
            "legendgroup": 'M',
            "scalegroup": 'M',
            "name": 'M',
            "box": {
                "visible": True
            },
            "meanline": {
                "visible": True
            },
            "line": {
                "color": 'blue'
            }
        },
        {
            "type": 'violin',
            "x": bf_df['AgeCategory'] [ bf_df['Gender'] == 'F' ],
            "y": bf_df['Purchase'] [ bf_df['Gender'] == 'F' ],
            "legendgroup": 'F',
            "scalegroup": 'F',
            "name": 'F',
            "box": {
                "visible": True
            },
            "meanline": {
                "visible": True
            },
            "line": {
                "color": 'pink'
            }
        }
    ],
    "layout" : {
        "yaxis": {
            "zeroline": False,
        },
        "violinmode": "group"
    }
}


py.iplot(fig, filename = 'violin/grouped', validate = False)

The Violin plots show the Box plot along with Kernal density estimation which is a non-parametric method of estimating the probability density function (PDF) of a continuous random variable. The violin plots for Male & Female that fall under age category "Late Twenties & Early Thirties" has multiple kde values that gives more curves to the plot. The plot for Teens category seem to be more dense around the mean & less dense every where else.

In [ ]: